31 July 2017

S0: Prior to lecture

Preparing for this lecture

  • Please run these codes on your laptop,
## Might be a while...
install.packages(c("ggplot2","dplyr", "readr","tidyr","janitor","plotly",
                   "devtools","learnr","gapminder")) 

library(devtools)
install_github("kevinwang09/2017_STAT3914", subdir = "learnr3914")
  • Familiar yourself with the iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.

S1: Necessary of Applied Statistics

Good statistical discoveries don't fall out from the sky

  • Statisticians are great at many things:

    1. Understanding data characteristics
    2. Building statistical/mathematical models
    3. Repeat 1 and 2…like…a lot…
    4. Extract insights
  • But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)

Let \(\boldsymbol{X}\) be the thing I want…

  • The real problem is not applying fancy shampoo for your cat. It is getting your cat into the bathtub.

Hidden side of being a statistician

  • Assume we have data
  • Assume we have questions, which the data can be used to answer
  • Assume we have cleaned data
  • Assume we interrogated the right aspects of the data using appropriate statistics
  • Assume we did everything right, communicate insights with others

Summary of this lecture

  • Passive learning is not going to work.
  • Our goal: clean the dirtyIris data to be exactly the same as the original iris data.
  • S1: Introduction
  • S2: Reading in data using readr and readxl
  • S3: Basic data cleaning using janitor
  • S4: Clean coding using magrittr
  • S5: Data filtering using dplyr
  • S6: Data visualisation using ggplot2
  • S7: Conclusion

S2: Reading data

Better read/write data

  • base R functions are not sufficient for modern uses.

  • readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.

library(readr)

Reading data using readr (1)

dirtyIris = readr::read_csv("dirtyIris.csv")
## Parsed with column specification:
## cols(
##   SepAl....LeNgth = col_double(),
##   `Sepal.?    Width` = col_double(),
##   `petal.Length(*&^` = col_double(),
##   `petal.$#^&Width` = col_double(),
##   `SPECIES^` = col_character(),
##   allEmpty = col_character()
## )
class(dirtyIris) ## `tibble` is a `data.frame` with better formatting.
## [1] "tbl_df"     "tbl"        "data.frame"
  • readxl and haven (for SAS, SPSS etc) packages work similarly.

Reading data using readr (2)

dirtyIris
## # A tibble: 250 x 6
##   SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
##             <dbl>              <dbl>              <dbl>             <dbl>
## 1      5.80000000                2.6                4.0               1.2
## 2      5.80000000                2.7                5.1               1.9
## 3      0.01874617                 NA                 NA                NA
## 4              NA                 NA                 NA                NA
## 5              NA                 NA                 NA                NA
## # ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
## #   allEmpty <chr>
  • We now proceed to data cleaning on the dirtyIris dataset.

Too trivial? Here is a short homework

Here is a dataset. Click here.

  1. Write 2 sentences about what is a .gmt file and who publishes this format?
  2. Which packages can read in .gmt files?
  3. How to download this package?
  4. What class is this data once read into R? Is it a data.frame?
  5. The data contains 50 different gene-sets. What is the size of each gene-set?
  6. What is the mostly frequent mentioned 6 genes?

S3: Cleaned data

What is clean data?

Clean data is a data set that allows you to do statistical modelling without extra processing

  1. Good documentation on the entire data.
  2. Each column is a variable. The name should be informative, and:
    • No bad characters/formatting @KevinWang009
    • No inconsistent capitalisation or separators (Cricket_australia vs cricket.Australia)
  3. Each row is an observation:
    • No bad characters
    • No poorly designed row names (3, 2, 5, … )
    • No repeated row names (a, a.1, b, b.1, … )

Data cleaning in R

  • Clean data is a well-designed data.frame.

  • Column type (esp. dates and factors) handling was the primary reason we used readr instead of base R when importing data.

  • Our goal: clean the dirtyIris data to be exactly the same as the original iris data.

    • Basic data cleaning using janitor package.
    • More advanced data manipulation through dplyr.

janitor: basic data cleaning

  • Clean up the bad column names
library(janitor)
library(dplyr)
glimpse(dirtyIris)
## Observations: 250
## Variables: 6
## $ SepAl....LeNgth  <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6...
## $ Sepal.?    Width <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3...
## $ petal.Length(*&^ <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1...
## $ petal.$#^&Width  <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0...
## $ SPECIES^         <chr> "versicolor", "virginica", NA, NA, NA, "virgi...
## $ allEmpty         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Clean up column names
better = clean_names(dirtyIris) 
glimpse(better)
## Observations: 250
## Variables: 6
## $ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6.500...
## $ sepal_width  <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3.8, ...
## $ petal_length <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1.9, ...
## $ petal_width  <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0.4, ...
## $ species      <chr> "versicolor", "virginica", NA, NA, NA, "virginica...
## $ allempty     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

janitor: removal of empty rows and columns

  • Purely empty rows/columns are non-informative.
## Removing empty rows/columns
evenBetter = remove_empty_rows(better)
evenBetter = remove_empty_cols(evenBetter)

glimpse(evenBetter)
## Observations: 209
## Variables: 5
## $ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, 6.50000000, 6...
## $ sepal_width  <dbl> 2.6, 2.7, NA, 3.0, 2.9, NA, 3.8, 2.8, NA, 3.0, 3....
## $ petal_length <dbl> 4.0, 5.1, NA, 5.5, 5.6, NA, 1.9, 4.8, NA, 4.2, 5....
## $ petal_width  <dbl> 1.2, 1.9, NA, 1.8, 1.8, NA, 0.4, 1.8, NA, 1.5, 2....
## $ species      <chr> "versicolor", "virginica", NA, "virginica", "virg...

janitor: removal of any rows with NA

  • Genuinely missing values should be retained, but in this case, the NA's were added. Only use na.omit when you 100% certain of the structure of your data.
evenBetterBetter = na.omit(evenBetter) 
cleanIris = evenBetterBetter
glimpse(cleanIris)
## Observations: 150
## Variables: 5
## $ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,...
## $ sepal_width  <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,...
## $ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,...
## $ petal_width  <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,...
## $ species      <chr> "versicolor", "virginica", "virginica", "virginic...
glimpse(iris)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

S4: Clean coding (skim through)

Coding complexity increases with the number of brackets

  • The "inside out" structure of coding isn't great for human reading.
mean(cleanIris$sepal_length)
## [1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)

Piping: read code from left to right

  • We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".

  • Compounded operations are possible. Keyboard shortcut is Cmd+shift+M.

cleanIris$sepal_length %>% mean
## [1] 5.843333
cleanIris$sepal_length %>%
  density %>%
  plot(col = "red", lwd = 2)

S5: dplyr: data subsetting master

Traditional way of subsetting data in R (1)

  • If I want the first two rows of column sepal_length and sepal_width in the cleanIris data:
## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]

## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]

## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]

Traditional way of subsetting data in R (2)

  • Something more realistic: we want to extract rows based on some compounded criteria and select columns based on special keywords.
cleanIris[(cleanIris[,"sepal_length"] < 5) &
            (cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
## # A tibble: 4 x 2
##   petal_length sepal_length
##          <dbl>        <dbl>
## 1          1.3          4.5
## 2          1.4          4.4
## 3          3.3          4.9
## 4          4.5          4.9
  • (Optional) A pro R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.

Subsetting data using dplyr

  • Think of subsetting rows and columns as two separate different procedures:
  • select columns are operations on variables, and
  • filter rows are operations on observations

  • See dplyr cheatsheet.

library(dplyr)

cleanIris %>%
  filter(sepal_length < 5,
         sepal_width < 3) %>%
  select(contains("length"))
## # A tibble: 4 x 2
##   sepal_length petal_length
##          <dbl>        <dbl>
## 1          4.5          1.3
## 2          4.4          1.4
## 3          4.9          3.3
## 4          4.9          4.5

dplyr: mutate create new columns

iris_mutated = mutate(cleanIris,
      newVar = sepal_length + petal_length,
      newNewVar = newVar*2)

iris_mutated
## # A tibble: 150 x 7
##   sepal_length sepal_width petal_length petal_width    species newVar
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>  <dbl>
## 1          5.8         2.6          4.0         1.2 versicolor    9.8
## 2          5.8         2.7          5.1         1.9  virginica   10.9
## 3          6.5         3.0          5.5         1.8  virginica   12.0
## 4          6.3         2.9          5.6         1.8  virginica   11.9
## 5          5.1         3.8          1.9         0.4     setosa    7.0
## # ... with 145 more rows, and 1 more variables: newNewVar <dbl>

group_by + summarise will create summary statistics for grouped variables

bySpecies = cleanIris %>%
  group_by(species)

bySpecies
## # A tibble: 150 x 5
## # Groups:   species [3]
##   sepal_length sepal_width petal_length petal_width    species
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>
## 1          5.8         2.6          4.0         1.2 versicolor
## 2          5.8         2.7          5.1         1.9  virginica
## 3          6.5         3.0          5.5         1.8  virginica
## 4          6.3         2.9          5.6         1.8  virginica
## 5          5.1         3.8          1.9         0.4     setosa
## # ... with 145 more rows
bySpecies %>%
  summarise(meanSepalLength = mean(sepal_length))
## # A tibble: 3 x 2
##      species meanSepalLength
##        <chr>           <dbl>
## 1     setosa           5.006
## 2 versicolor           5.936
## 3  virginica           6.588

left_join for merging data

flowers = data.frame(species = c("setosa", "versicolor", "virginica"),
                     comments = c("meh", "kinda_okay", "love_it!"))

## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")
## Warning: Column `species` joining character vector and factor, coercing
## into character vector
## Randomly sampling 6 rows 
sample_n(iris_comments, 6) 
## # A tibble: 6 x 6
##   sepal_length sepal_width petal_length petal_width    species   comments
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>     <fctr>
## 1          5.8         2.7          5.1         1.9  virginica   love_it!
## 2          5.5         4.2          1.4         0.2     setosa        meh
## 3          5.0         3.3          1.4         0.2     setosa        meh
## 4          6.2         2.8          4.8         1.8  virginica   love_it!
## 5          5.0         3.4          1.5         0.2     setosa        meh
## 6          5.8         2.6          4.0         1.2 versicolor kinda_okay

arrange for ordering rows

arrangeCleanIris = cleanIris %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)

## The true iris data
arrangeIris = iris %>% 
  clean_names() %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)

Checking if we cleaned the data properly

  • We sorted both the processed dirtyIris data and the arranged iris data.
## The `Species` column is character or factor
all.equal(arrangeCleanIris, arrangeIris) 
## [1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% 
  mutate(species = as.character(species)) 

## Great! 
all.equal(arrangeCleanIris, arrangeIris)
## [1] TRUE

dplyr special select functions (advanced)

  • select only if a column satisfy a certain condition
cleanIris %>%
  select(starts_with("sepal")) %>% 
  top_n(3, sepal_width)
## # A tibble: 3 x 2
##   sepal_length sepal_width
##          <dbl>       <dbl>
## 1          5.5         4.2
## 2          5.2         4.1
## 3          5.7         4.4
bySpecies %>%
  summarise_if(is.numeric,
               funs(m = mean))
## # A tibble: 3 x 5
##      species sepal_length_m sepal_width_m petal_length_m petal_width_m
##        <chr>          <dbl>         <dbl>          <dbl>         <dbl>
## 1     setosa          5.006         3.428          1.462         0.246
## 2 versicolor          5.936         2.770          4.260         1.326
## 3  virginica          6.588         2.974          5.552         2.026

S6: ggplot2: the best visualisation package

ggplot2 tutorial sheet

ggplot2: the philosophy

  • Di Cook - the real reason that you should use ggplot2 is that, its design will force you to use a certain grammar when producing a plot.

  • \(\frac{1}{n}\sum_{i=1}^{n} X_i\) is a transformation of random variables, i.e., a statistic which provides insights into a data.

  • Similarly, ggplot is also a statistic, because we take components of the data and presented it in an informative way.

  • Publishing quality, rigourous syntax and design, flexible customisations, facetting.

ggplot example

library(ggplot2)
p1 =
  iris %>%
  ggplot(aes(x = Sepal.Length,
             y = Sepal.Width,
             colour = Species)) +
  geom_point(size = 3)

p1 

S7: Conclusion

tidy data, coding, modelling and reporting

  • tidyverse is a collection of 20+ packages built on the philosophy of being organised for the purpose of collaboration.

  • These functions:
    • Well designed programming and data science solutions.
    • They will always throw errors at you if you don't have a thorough understanding of your data.
    • Capable for functional programming.

Preview of next lecture

Interactive plotting from ggplot

library(plotly)
ggplotly(p2)

Advice in the future

  • Use RStudio + RMarkdown to document your codes.

  • Learn some computational tools. They are not statistics, but not learning them could inhibit your career aspects.

  • Find "cool" components and adapt those into your work routine. (Hint: start with all RStudio cheatsheets and build up gradually.)

  • Take time to re-analyse an old dataset.

  • Learn core functions and vignette.

  • Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.

Session Info and References

  • Dr. Garth Tarr
  • tidyverse.org
  • github.com/sfirke/janitor
  • gapminder.org
  • rstudio.com